Abstract
Graphical User Interface (GUI) grounding maps natural language instructionsto precise interface locations for autonomous interaction. Currentreinforcement learning approaches use binary rewards that treat elements ashit-or-miss targets, creating sparse signals that ignore the continuous natureof spatial interactions. Motivated by human clicking behavior that naturallyforms Gaussian distributions centered on target elements, we introduce GUIGaussian Grounding Rewards (GUI-G$^2$), a principled reward framework thatmodels GUI elements as continuous Gaussian distributions across the interfaceplane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian pointrewards model precise localization through exponentially decaying distributionscentered on element centroids, while coverage rewards assess spatial alignmentby measuring the overlap between predicted Gaussian distributions and targetregions. To handle diverse element scales, we develop an adaptive variancemechanism that calibrates reward distributions based on element dimensions.This framework transforms GUI grounding from sparse binary classification todense continuous optimization, where Gaussian distributions generate richgradient signals that guide models toward optimal interaction positions.Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Probenchmarks demonstrate that GUI-G$^2$, substantially outperformsstate-of-the-art method UI-TARS-72B, with the most significant improvement of24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling providessuperior robustness to interface variations and enhanced generalization tounseen layouts, establishing a new paradigm for spatial reasoning in GUIinteraction tasks.